5. Introduction to deep learning

Big data in Economics

Juan D. Montoro-Pons | 2024/25

Artificial Neural Networks (ANN)

 

  • Artificial neural networks are universal function approximators, i.e. functions that can approximate any continuous function.

  • ANN consist of nodes, each of which does a computation on an input, and layers, which are collections of nodes that have access to the same inputs.

  • While there are many variations of neural networks the most common is the multi-layer perceptron or feed-forward neural network.

  • They can be applied to supervised learning (e.g. regression and classification), unsupervised learning, and reinforcement learning.

ANN form the basis of deep learning models.

Source: Introduction to Statistical Learning

Neural networks: structure

 

Structure of a feed-forward NN with one hidden layer:

  • Input layer: set of neurons \(X_1,X_2\ldots X_p\) (input features). Each input feeds into \(K\) neurons in a hidden layer
  • Hidden layer: set of \(K\) neurons that transform inputs. Each neuron produces a value or activation \(A_k\) computed as a function of the inputs \[A_k=h_k(X)=g(w_{k0}+\sum^p_{j=1}w_{kj}X_j)\] where \(g(z)\) is a nonlinear activation function.
  • The activations produced in the hidden layer feed into the output layer \[f(X)=\beta_0+\sum_k \beta_k A_k\] (A linear regression model in the \(K\) activations)

Neural networks: the model

 

A neural network takes a vector of \(p\) variables \(X=(X_1,X_2,\ldots X_p)\) to build a nonlinear function \(f(X)\) to predict some outcome \(Y.\)

The resulting NN model with one hidden layer can be summarized as

\[f(X)=\beta_0+\sum_k \beta_k A_k = \beta_0+\sum_k \beta_k h_k(X)= \beta_0+\sum_k \beta_k\, g(w_{k0}+\sum^p_{j=1}w_{kj}X_j)\] where all parameters \(\beta_0 \ldots \beta_K, w_{10}\ldots w_{Kp}\) are estimated from the data.

Neural networks: activation functions

 

  1. Sigmoid activation:

\[g(z)=\frac{e^z}{1+e^z}=\frac{1}{1+e^{-z}}\]

  1. ReLU (rectified linear unit):

\[g(z)= \begin{cases} 0, & z < 0 \\ z, & \textrm{otherwise} \end{cases}\]

The use of nonlinear activation functions is critical for generating high-quality approximations

Neural networks: interpretation

 

 

Interpretation of a feed-forward one hidden-layer NN

  • Features \(X_1\ldots X_4\) feed the hidden layer
  • The NN derives five new features by computing \(K=5\) different linear combinations of \(X\)
  • Next, these are plugged into a nonlinear activation function \(g(\cdot)\) to produce \(A_k\)
  • The final model is linear in these derived variables \(A_k\)

Nonlinearity in \(g\) is basic: without it \(f(X)\) would collapse into a linear model.

Estimation (regressors)

 

Estimators for \(\theta = \beta_0 \ldots \beta_K, w_{10}\ldots w_{Kp}\) are the solution to the minimization of the penalized loss function

\[\sum (y_i-f(X))^2 + \lambda R(\beta,w)\]

Main ingredients:

  • Regularizers: LASSO and Ridge penalties

  • Slow learning is achieved in an iterative fashion using stochastic gradient descent (SGD).

Gradient Descent

Let \(f(\theta)\) be the loss function to minimize.

  • Starting from a random guess at \(\theta\), gradient descent takes small steps in the direction opposite to the gradient \(\nabla f(\theta)\) of the loss function.
  • The gradient indicates the direction of maximum increase in loss, hence moving in the opposite direction of the gradient decreases the loss.
  • The size of each step is controlled by the step-size or learning rate \(\eta\).

 

Gradient Descent Algorithm

Initialize: take an initial guess for the parameters \(\theta^0\)

Loop: for \(i=1,\ldots,M\):

  • Compute the gradient: \(\nabla f(\theta^{i−1})\)

  • Update the parameters: \(\theta^i=\theta^{i−1}−\eta \nabla f(\theta^{i−1})\)

  • Continue until parameters converge to a minimum: if \(∣f(\theta^i)−f(\theta^{i−1})∣<\epsilon\) break

Return \(\theta^i\)

Remarks: Gradient descent uses the entire dataset to compute the gradient. It can be slow for large datasets.

Stochastic Gradient Descent (SGD)

  • SGD computes gradients on a random sample of the data or batch (often a single observation); a full pass through all batches is called an epoch.

  • SGD scales well to large to massive datasets due to the redudes gradient computation per step by using only batches instead of the full dataset.

  • Using subsamples introduces stochasticity, as opposed to computing the full gradient (thus helping avoid local minima but making convergence noisier).

 

Stochastic Gradient Descent Algorithm

Initialize: take an initial guess for the parameters \(\theta^0\)

Loop: for \(i=1,\ldots,M\):

  • Random sampling: select a random sample of points (or one point)

  • Compute the gradient at the selected point(s): \(\nabla f(\theta^{i−1})\)

  • Update the parameters: \(\theta^i=\theta^{i−1}−\eta \nabla f(\theta^{i−1})\)

  • Continue until parameters converge to a minimum: if \(∣f(\theta^i)−f(\theta^{i−1})∣<\epsilon\) break

Return \(\theta^i\)

Further regularization

 

Optimization methods in neural networks offer regularization beyond just penalizing coefficient size. These include:

  • Dropout regularization is a common technique where each neuron is randomly set to zero with a certain probability (e.g., 0.1) during updates.

    • Dropout promotes robustness by encouraging the network to develop redundant neurons that can replicate important functions.
    • This acts as a regularization penalty, enforcing similar weights across groups of neurons, preventing over-reliance on specific ones.
  • Early stopping monitors out-of-sample prediction accuracy alongside the in-sample objective function.

    • Instead of minimizing the in-sample objective, training stops when out-of-sample performance starts to degrade, preventing overfitting by balancing in-sample optimization with generalization.
    • Parameters are updated based on in-sample fit but training halts based on out-of-sample performance.

Source: Causal inference with ML and AI

Deep neural networks

 

  • Model flexibility depends on the number of neurons (width) and layers (depth) in a neural network.
  • More neurons or layers increase flexibility, similar to adding regressors in high-dimensional linear models.
  • Regularization interacts with network size, helping prevent overfitting when using deeper or wider networks.

Modern neural networks typically have more than one hidden layer, and often many units per layer. In theory a single hidden layer with a large number of units has the ability to approximate most functions. However, the learning task of discovering a good solution is made much easier with multiple layers each of modest size.

Tuning neural networks

 

Neural network training requires selecting many tuning parameters typically chosen using validation methods.

  • Number of hidden layers and units per layer
  • Dropout rate
  • Strength of LASSO/Ridge regularization
  • Details of stochastic gradient descent: batch size and number of epochs

Deep learning inPython

 

Implementations

  • sklearn offers two classes for deep learning: MLPRegressor and MLPClassifier

  • TensorFlow (open-source machine learning framework developed by Google) + Keras (user-friendly interface that runs on top of TensorFlow and simplifies complex tasks like model definition, training, and evaluation)

  • PyTorch (open-source deep learning framework developed by Facebook) + Pytorch Lightning (similar to Keras)